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Abstract 

Stochastic variational inference (SVI) is emerg¬ 
ing as the most promising candidate for scal¬ 
ing inference in Bayesian probabilistic models 
to large datasets. However, the performance of 
these methods has been assessed primarily in the 
context of Bayesian topic models, particularly 
latent Dirichlet allocation (EDA). Deriving sev¬ 
eral new algorithms, and using synthetic, image 
and genomic datasets, we investigate whether the 
understanding gleaned from EDA applies in the 
setting of sparse latent factor models, specifi¬ 
cally beta process factor analysis (BPFA). We 
demonstrate that the big picture is consistent; us¬ 
ing Gibbs sampling within SVI to maintain cer¬ 
tain posterior dependencies is extremely effec¬ 
tive. However, we find that different posterior 
dependencies are important in BPFA relative to 
EDA. Particularly, approximations able to model 
intra-local variable dependence perform best. 


1. Introduction 

The last two decades have seen an explosion in the de¬ 
velopment of flexible statistical methods able to model di¬ 
verse data sources. Bayesian nonparametric priors in par¬ 
ticular provide a powerful framework to enable models to 
adapt their complexity to the data at hand (Orbanz & Teh, 
2010). In the regression setting this might mean learn¬ 
ing the smoothness of the output function (Rasmussen & 
Williams, 2006), for clustering adapting the number of 
components (MacEachern & Muller, 1998), and in the case 
of our interest, latent factor models, finding an appropri¬ 
ate number of latent features (Knowles et al., 2011). While 
such models are appealing for a range of applied data anal- 
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ysis applications, their scalability is often limited. The 
posterior distribution over parameters and latent variables 
is typically analytically intractable and highly multimodal, 
making MCMC, particularly Gibbs sampling, the norm. 
Along with concerns over performance and convergence, 
MCMC methods are often impractical for the applied prac¬ 
titioner: how should the multiple samples be summarized? 
Variational methods work on the basis that simply finding 
a good posterior mode, and giving some measure of the 
associated uncertainty, is typically sufficient. In addition, 
the predictive performance of variational methods is often 
comparable to more computationally expensive sampling 
based approaches (Ghahramani & Beal, 1999). 

Recently stochastic variational inference has begun to 
emerge as the most promising avenue for scaling inference 
in large latent variable models (Hoffman et al., 2013). Mar¬ 
rying variational inference with stochastic gradient descent 
allows principled updates using only minibatches of obser¬ 
vations, greatly improving data scalability. While some 
MCMC methods have been proposed to work with mini¬ 
batches (Welling & Teh, 2011; Ahn et al., 2012) they lack 
theoretical guarantees and apply only to continuous, un¬ 
bounded latent variables. While SVI has been influential 
for Bayesian topic modeling, particularly latent Dirichlet 
allocation (EDA, Mimno et al., 2012; Hoffman & Blei, 
2014; Wang & Blei, 2012), the same cannot be said for 
sparse factor analysis models for continuous data. While 
the former has been driven by the ready availability of huge 
text corpora, the scale of continuous data being generated 
by new genomic technologies is still growing. For example, 
CyTOF, single cell time of flight mass spectrometry is able 
to measure the abundance of dozens of proteins in hundreds 
of thousands of cells in a single run (Bendall et al., 201 1). 
Such complex, large scale, high dimensional datasets re¬ 
quire sophisticated statistical models, but are typically ana¬ 
lyzed using simple heuristic clustering methods or PCA, 
which do not capture important structure, such as spar¬ 
sity. As a result, scaling more advanced factor analysis type 
models is of great interest. 
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When desigining a variational approximation to a poste¬ 
rior distribution, one must trade off the accuracy of the 
approximation with the complexity of optimizing the ev¬ 
idence lower bound. Mean field approximations are sim¬ 
ple to work with, but Hoffman & Blei (2014) demonstrated 
that in the context of LDA, maintaining posterior depen¬ 
dence between “global” variables (topic vectors) and “lo¬ 
cal” variables (document vectors) is crucial to finding good 
solutions. Does this finding hold for sparse factor analy¬ 
sis models? Our results suggest that contrary to the LDA 
case, maintaining dependencies amongst local variables is 
actually the most important ingredient for obtaining good 
performance with beta Bernoulli process SVI. 

2. Beta Process for Factor Analysis 

The beta process (Hjort, 1990; Thibaux & Jordan, 2007) is 
an independent increments process defined as follows: 

Definition 1. Let LI be a measurable space and B its a- 
algebra. Let Hq be a continuous probability measure on 
{Ll,B) and a a positive scalar. Then for all disjoint, in¬ 
finitesimal partitions, {Bi,B k}, of LI the beta process 
is generated as follows, 

H{Bk) “ Beta(a7Lo(Sfc), a(l - Ho{Bk))) (1) 

with AT —>■ oo and HQ{Bk) —>■ 0 for k = l,...,Ar. We 
denote the process H ^ BP(ai7o)- 


Chinese restaurant process, a marginalized approach used 
for sampling from the Dirichlet process, there exists an 
efficient marginalized scheme for sampling from the beta 
process, called the Indian buffet process (IBP, Griffiths & 
Ghahramani, 2006; Thibaux & Jordan, 2007). 

The IBP sampling procedure introduces strong dependen¬ 
cies between the rows of Z. Our goal is to derive a stochas¬ 
tic variational inference scheme where we consider rows in 
batches. It will hence be crucial to instantiate the global 
parameters rather than marginalize over them. 

For this reason, we shall consider a finite approximation 
to the beta process which simply set iT to a large, finite 
number. The finite representation is written as 

K 

H{uj) = '^TrkS^^{uj) 
fe=i 

TTfe ^ Beta(a/iT, b{K - l)/K), uju ^ Ho (3) 

and the AT-dimensional vector, Zi, is drawn from a finite 
Bernoulli process parameterized by H. 

Consider modelling a data matrix Y G where rows 

represent data points. Factor analysis models this data as 
the product of two matrices L G and €> G 

plus an error matrix, E. 

Y = L^ + E (4) 


Hjort considers a generalization of this definition including 
functions, a{Bk), which we set as constants for the sake 
of simplicity. Analogous to the Dirichlet process, the beta 
process may be written in set function form as 

OO 

H{lo) = ^ (^) (2) 

k^l 

with H{uji) = TTi. Note that the beta process is not a nor¬ 
malized random measure. Hence the tt of a beta process 
does not represent a probability mass function on Lt, but 
instead can be used to parametrize the Bernoulli process, a 
new measure on LI defined as follows: 

Definition 2. Let zi be an infinite row vector with kf^ 

value, Zik, generated by Zik\T^k ~ Bernoulli( tt/j). The 
measure defined by Xi(uj) = (^) o draw 

from a Bernoulli process, which we denote Xi ~ BeP(L7). 

If we were to stack samples of the infinite-dimensional vec¬ 
tor, Zi, to form a matrix, Z — [z^^, we may 
view the beta-Bemoulli process as a prior over infinite bi¬ 
nary matrices (Griffiths & Ghahramani, 201 1), where each 
column in the matrix Z corresponds to a location, Sui- 

Sampling H directly, as defined in (2), is difficult to do ex¬ 
actly and efficiently. But, just as Aldous (1985) derived the 


Prior belief about the structure of the data may be used 
to induce the desired propeties of L and $, e.g. sparsity 
(West, 2003; Rai & Daume, 2008; Knowles & Ghahra¬ 
mani, 2007). To encourage sparsity, we model L as the 
Hadamard (element-wise) product between matrices Z and 
W, L = Z o W, where Z is binary and W is a Gaussian 
weight matrix. This idea is described in Section 3 of (Grif¬ 
fiths & Ghahramani, 201 1). We model the matrices $ and 
Z as N draws from a beta-Bemoulli process parameterized 
by a beta process, H. 

Using the truncated beta process of (3), we have the fol¬ 
lowing generative process for observation i = 1 ,..., N and 
features k = 1,K, 

yi = {ziowi)^ -f Ei 
Wi 

ZiklT^k ^ Bernoulli(7rfc) 

ei -AA(0,7-J/) 

TTk ~ Beta(a/Ar, b{K — 1)/K) 


(5) 

Local variables 

Global variables 


where all values are drawn independently. This is the gen¬ 
erative model used for beta process factor analysis (Pais¬ 
ley & Garin, 2009). We place independent Gamma(c', d') 
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and Gamma(e', f) priors on 7obs and 7^, respectively. The 
separation of local and global variables will be crucial for 
the stochastic variational inference algorithm which we de¬ 
rive in the next section. For the sake of brevity, we denote 
the set of global variables /3 = {tt, 4*,7u,,7obs} and sets 
of local variables -i/jj = {wi, Zi} for i = 1 ,..., N. 

3. Variational Inference Schemes 

The true posterior distribution p{P,ipi.pf\xi.N) involves 
complicated dependencies between latent variables, which 
makes inference complicated. The goal of variational in¬ 
ference is to approximate the true posterior with a family 
of distributions q{(3 , ijj i.^). We choose the best mem¬ 
ber of the chosen family of distributions by minimizing 
the KL-divergence between this variational distribution and 
the true posterior. Equivalently, we maximize the evidence 
lower bound (ELBO), 

C{q) = Eq[\ogp{f3,'ip^.N,xi:N) - logq{f3,ip^,ff)]. (6) 

In this work, we compare the performance of a range of 
variational approximations. Each of the approximations we 
consider factorizes as follows 

9(^,'0i:Ar) = (qhohs)qhw)Y[qiTTk)qi(t>k))Q{^i-.N\f^), 

k 


where is an expectation over all latent variables with 
respect to q (except when global parameters are being sam¬ 
pled), and = Vi - ^q[Y.j^k We work 

with the natural parameters of the global variational dis¬ 
tributions. Natural gradients give the direction of steepest 
ascent in Riemannian space, leading to faster convergence 
for e.g. maximum likelihood estimation (Amari, 1998). 

Our aim is to update the global variational parameters 
stochastically, by considering subsets of the full dataset 
and making sequential updates. Let rj denote the vec¬ 
tor of global natural parameters of p{/3), and by condi¬ 
tional conjugacy, the vector of global natural parameters 
of q{(3) is r] + J2i '*/’*)■ In fact, y = [a/K, b{K - 
i-)/K, c', d', e', /', D, 0], and is a vector consisting of 
each of the elements of the sums in Equation 7. We 
have followed the notation of Hoffman & Blei (2014). The 
general framework of stochastic variational inference we 
shall follow is summarized in Algorithm 1 . 

The difficult step is in computing f)^, and it is entirely de¬ 
pendent on the form of the local variable approximation, of 
which we consider 2 types: ‘Unstructured’ methods where 
q{^Pl.j^\(3) = q{ilti.pf) and ‘structured’ methods where 
this equivalence does not hold. Our notion of ‘structure’ 
describes the dependence between local and global vari¬ 
ables, as discussed by Hoffman & Blei (2014). 


where qi'johs) = Gamma(c, d), q^jw) = Gamma(e,/), 

and q(7rfc) = Beta(afe,6fe). 
The global variational distributions are all of the same ex¬ 
ponential family forms as their posterior conditional dis¬ 
tributions. The full set of global variational parameters is 
A = {uk, bk, c, d, e, /, r*, /i^}. Due to conjugacy of our 
model, it is easy to show that the updates for the global 
variational parameters during the variational M-step are as 
follows 

Gk = a/K+ '^Eq[z^k] (7) 

i 

h = b{K -l)/K + ^{l- Eq[zik]^ 

i 

c = c' + J2 ^/2 


d = d' + Y^lEq 

i 




e = e + KI2 

i 

f = f 


— D 'y ^ Eg [7obs'2'ifc'tUifc ] 
i 

l^k ='^^q[z^kW^ky~’'] 


3.1. Unstructured Variational Methods 

The simplest, and most commonly used, approximation we 
can make is the mean field approximation, 

K 

QMFi'iPi) = q{zik)q{'w^k) ( 8 ) 

k^l 

where q{zik) = Bernoulli(0ifc) and q{wik) = 
■^i'^7k’Zik,K~i}). Given the current set of global 
parameters, the local ELBO, Eiocai = 

Eg(/3)gMp(^^,„)[logp(yi,^,r/ji^^|/3) - logg(r/ji,^)] 

is optimized as a function of local variational parameters 
{dik, Vik: Kifc}- Up to irrelevant constants. 


C 


local 


= -F 

2d ^ 


i.k 


^ik 


^ik 


Oft 

^^ik Vi 

^ik '^k 

1 \ (iJ-ki4 , 1 


ftik J V '^k‘^ 


Tk 


E a ^ l^ij Vik 

aijOik - 

■ ^ K] Kik TjTk 


Tn 




^ik 

^ik^ 


+ F - '0(&fe)) - 9 F 


i.k 


- F log Oik + (1 - Oik) log(l - Oik)] 


1 

^ik 


(9) 
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where ijj is the digamma function. The mean field approx¬ 
imation breaks dependencies between all local and global 
variables, and will provide a baseline to compare against. 
It is possible to compute [ryj analytically given 

the optimized local variational parameters. We denote the 
SVI algorithm which uses a mean field local variable ap¬ 
proximation as MF-SVI. It is identical to the original SVI 
algorithm introduced by Hoffman et al. (2013). 

Mimno et al. (2012) suggested an online SVI method which 
maintains structure between local variables specifically for 
the LDA. We generalize their idea by suggesting the fol¬ 
lowing variational distribution over local parameters 

gMimno('0i) = exp(E,(^)[logp(t/j,|yi.^,/3)]) (10) 

where p{'ip^\yi.jsj, f3) is the true posterior conditional of 
■j/ij. Whilst we are unable to compute [ryj ana¬ 

lytically, we are able to estimate it using MCMC. The SVI 
algorithm using gMimno as the local variational distribution 
shall be called Mimno-SVI. 

3.2. Structured Variational Methods 

Instead of taking an expectation over q{(3) to compute f]^ 
as in the previous section, we use the current set of global 
parameters to draw a sample and compute an 
estimate of E^^^ Under this framework. Algo¬ 

rithm 1 becomes the SSVI-A algorithm of Hoffman & Blei 
(2014). 

Once again, the simplest approximation that can be made is 
the conditional mean-field approximation, where , Wik 
are independent given /3, with q{zik) — Bernoulli(0ifc) 
and q{wik) = This time, we optimize 

the local ELBO, [logp(yi,jv> - 

log as a function of local variational parameters 

{0^k,Vik,K^k}, and compute [ryj analyti¬ 

cally given the optimized parameters. We shall call this 
SVI method MF-SSVI. 

The Bernoulli-Gaussian products present in the generative 
process in Equation 5 can be thought of as a spike-and- 
slab model. Titsias & Lazaro-Gredilla (2011) developed a 
variational method which maintains dependence between 
Zik and Wik for eack k, such that 

= ]^Bernoulli(zifc;0jfc) (11) 

k 

X J\f i^Wik ^ ^ikk^^k ^ik^ Z^ikk^j^k T (1 ^ ) • 

This approximation has the advantage that it maintains the 
spike-slab beaviour of the product ZikWik, and matches the 
exact posterior when Zik = 0. However, the dependen¬ 
cies between local variables for which fc fc' are lost. 
Analogous to ME-SSVI, we optimize the local ELBO us¬ 
ing gTitsias as a function of {0ik,v^k, Kik}, and compute 


Algorithm 1 General Stochastic Variational Inference 

Initialize t = 1, 

repeat 

Compute step size = {t + 

Select subset of full data set, T). 

Compute f)j, an (unbiased) estimator of Eq(.0j^) [ryj 
for each i G V 

Set A(‘) = (1 - p(‘))A(‘-i) + p(‘) (r, + ^ 
until convergence 


analytically given the optimized pa¬ 
rameters. We denote the SVI algorithm which uses the 
Titsias & Lazaro-Gredilla (2011) local approximation as 
Titsias-SSVI. 

Finally we consider using the exact local conditional dis¬ 
tribution given by q{^p^\f3^*'^) = We use 

MCMC samples to compute f]^ using a Gibbs sampling 
scheme. We therefore call this method Gibbs-SSVI. 

MF-SVI (Hoffman et al., 2013), MF-SSVI, Gibbs-SSVI 
(Hoffman & Blei, 2014) and Mimno-SVI (Mimno et al., 
2012) have been considered in the context of LDA before, 
but the latter 3 have not been applied to factor analysis to 
the best of our knowledge. Titsias-SSVI is a new method as 
Titsias & Lazaro-Gredilla (2011) applied their variational 
approximation only to regression tasks. More details on the 
variational approximations over local variables is provided 
in the appendix. 

4. Related Work 

The idea of applying variational inference to the Indian 
buffet process was first proposed in Doshi et al. (2008), 
based on the stick breaking construction of the IBP (Teh 
et al., 2007). Promising results were shown for the sim¬ 
ple but somewhat limited “linear Gaussian” model, which 
is the model presented here without the weight vector, Wi. 
Paisley & Carin (2009) consider the simpler finite approx¬ 
imation to the beta process described above, and extended 
the model to include continuous weights Wi. An exten¬ 
sion using power-EP, able to handle non-negativity con¬ 
straints, was developed in (Ding et al., 2010) but has not 
been widely adopted. Alternative approaches to scale in¬ 
ference in IBP based models have included parallelization 
(Doshi-Velez et al., 2009) and submodular optimization 
(Reed & Ghahramani, 2013). The former only performed 
approximate sampling, and the later is greedy and limited 
to positive weights. Mean field based stochastic variational 
inference schemes have been used for large scale dictio¬ 
nary learning, with some success (Li et al., 2012; Polatkan 
et al., 2014). However, we shall show that preserving de¬ 
pendencies between local variables will greatly improves 
performance on image interpolation and denoising tasks. 
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Figure 1. Predicitve loglikelihood versus training time on syn¬ 
thetically generated data, comparing Gibbs-SSVI, MF-SSVI and 
Gibbs sampling. The same legend is used throughout this paper. 


Figure 2. Predicitve loglikelihood versus training time on synthet¬ 
ically generated data using Gibbs-SSVI with burn-in lengths of 
0,1,3,5,10 and 25. Converged predictive loglikelihood is mono- 
tonically increasing in bum-in length, so no legend is included. 


Meanwhile the topic modelling community has taken great 
strides developing stochastic variational inference methods 
for latent Dirichlet allocation (Blei et al., 2003), encour¬ 
aged by the availability of large corpora of text. The idea 
was initially proposed in Hoffman et al. (2010), and refined 
in Mimno et al. (2012) where the sparse updates of Gibbs 
sampling were leveraged to scale inference on just a single 
machine to 1.2 million books. The latter idea allows non- 
truncated online learning (Wang & Blei, 2012) of Bayesian 
non-parametric models, though only the hierachical Dirich¬ 
let process (Teh et al., 2004) was demonstrated. 

More recently, Hoffman & Blei (2014); Liang & Hoff¬ 
man (2014) have shown that sampling from the global 
variational distribution improves predictive performance 
for the LDA and Bayesian non-negative matrix factoriza¬ 
tion respectively. In fact, the idea of optimizing an in¬ 
tractable variational inference algorithm by sampling from 
global variational distributions has been proposed in var¬ 
ious contexts to deal with non-conjugacy (Ji et al., 2010; 
Nott et al., 2012; Gerrish, 2013; Paisley et al., 2012; Ran- 
ganath et al., 2014). Kingma & Welling (2014); Titsias 
& Lazaro-Gredilla (2014); Salimans & Knowles (2013) 
propose change of variable methods to deal with non- 
conjugacy or improve convergence speed. In this work we 
focus more on the quality of the variational approximation 
and attempt to exploit the conditional conjugacy. 

5. Experiments 

In this section we discuss our findings from a range of ex¬ 
periments. Results from experiments carried out on syn¬ 
thetically generated data are discussed first. We apply a 
range of stochastic variational inference algorithms to carry 
out image inpainting and denoising tasks next. Finally the 


same algorithms are applied to two large genomic datasets. 
We choose to compare our models using predictive loglike¬ 
lihood of held out data, which we compute as 

p(y\y) « J 

-| Af Attest 

m—1 i—1 

( 12 ) 

where are independent samples 

from q, for whichever type of variational approximation is 
being used, and is the data point in the test set. 

In each of our experiments, we transform the data to have 
empirical mean 0 and variance 1. Hyperparameters are set 
as follows: a = 5 = 10, c = 1, c? = 10, e = / = 1, and a 
learning rate schedule of pt = is employed. 

5.1. Synthetic Data 

A key question that is important to consider when using 
variational approximations of a particular form is, ‘how 
close is the approximation to the true posterior?’. We at¬ 
tempt to answer such a question with our first experiment. 
Data was generated from our prior with parameters 7 ^, = 1, 
7 „bs = 100, K = 80, N = le5 and D = 40, with 7.5% 
selected uniformly at random held out for testing on 100 
independent experiments. We then applied Gibbs-SSVI 
and MF-SSVI, as well as an uncollapsed Gibbs sampler to 
the generated data using K = 150 potential features and 
random initialization. The predictive mean squared errors 
(MSB) of the 3 methods were 0.022 ±0.002, 0.027± 0.004 
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Figure 3. Results from interpolation of the 512 x 512 pixel ‘Boat’ image. PSNR vs training time shown using (a) Gihbs initialization 
and (b) random initialization. The pictures shown are the original image (top left), the image to be reconstructed with 80% of pixels 
unobserved (top right), the Gibbs-SSVI reconstruction (bottom left) and the MF-SSVI reconstruction (bottom right). 


and 0.020 ± 0.002 and the average per iteration training 
times were 1.5, 0.6 and 7.6 seconds respectively. Figure 1 
illustrates our findings. The Gibbs sampler achieves a high 
predictive likelihood, but the average training time per iter¬ 
ation was very high versus the SVI methods. Warm starting 
the Gibbs sampler at each iteration helped Gibbs-SSVI to 
converge in few iterations, whereas stochastically choos¬ 
ing subsets of data points in SSVI methods requires re¬ 
initializing local variables at each epoch. Notice that the 
predictive MSB of Gibbs-SSVI is close to the Gibbs sam¬ 
pler, suggesting that the correct mean is being learnt. The 
lower likelihood the SSVI methods are able to achieve is 
therefore due to a poor calibration in posterior variance, a 
known issue with variational methods (Consonni & Marin, 
2007). 

Another question of interest to us was, ‘what is the em¬ 
pirical trade-off between training time and unbiasedness in 
the Gibbs-SSVI scheme?’. More specifically, if we allow 
the Gibbs sampler over the local variables to converge, the 
subsequent ELBO gradient estimates would be unbiased, 
whilst using samples from a Gibbs chain which has not 
converged would lead to biased gradient estimates. Con¬ 
vergence of the Gibbs chain, however, may take a long 
time. Therefore we experimented with a range of burn-in 
lengths of the Gibbs chain on synthetically generated data. 
Various burn-in and sample length combinations were dis¬ 
cussed by Mimno et al. (2012). We tried burn-in lengths 
of 0,1,3,5,10 and 25 whilst fixing the number of samples 
used after burn-in to 3. The results can be seen in Figure 2. 
When the bum-in length is below 3 we notice severe loss 
in predictive power of the Gibbs-SSVI method. We notice 
diminishing gains in predictive power as we increase the 
length of burn-in. This experiment suggests that some bias 
introduced by using samples from an unconverged Gibbs 
chain may be worth the reduction in training time. For sub¬ 


sequent experiments, we fix the bum-in length to 3. 

5.2. Image Interpolation and Denoising 

Zhou et al. (2009) first applied the beta process for sparse 
image representation with good results and much follow 
up research. The standard metric used for quantifying the 
quality of a reconstructed image is the peak signal-to-noise 
ratio (PSNR), defined as 20 log 2 o(niaxi„iage/rmse), where 
maxiinage IS the maximum possible pixel value and rmse is 
the root mean squared error of the reconstmction. 

We consider overlapping 8x8 pixel patches as individ¬ 
ual 64 dimensional data points. The fact that the patches 
are overlapping technically breaks the exchangeability as¬ 
sumption of the prior distribution, however the extra model 
averaging is benehcial to prediction. Five grayscale im¬ 
ages originally from Portilla et al. (2003) were used for our 
study; Boat, Barbara, Fena, House and Peppers. The first 3 
are 512 x 512 in size whilst the last 2 are 256 x 256. The 
datasets are therefore of size N = (512 — 7)^ = 255,025 
andA^ = (256-7)2 = 62,001 for 512x 512 and 256x256 
images respectively. We use a batchsize of Agubset = 250 
and K = 250 features for our experiments. 

For our hrst experiment, we consider the task of image in¬ 
terpolation, where the task is to reconstract an image where 
only 20% of the pixels, chosen uniformly at random, are 
observed. Fi et al. (2012) consider a mean field based vari¬ 
ational approximation for such a task, however, the learn¬ 
ing rate schedule they used was pt = {t + 1000)“° ®. This 
implies pt < 0.032 for all f > 1, and that their algorithm 
relied heavily on the initialization of global parameters. 
They ran an MCMC algorithm over a subset of the data 
to initialize these global parameters, and we argue that this 
was integral to the performance of their algorithm. We de¬ 
cided to test how sensitive the variational algorithms were 


























Empirical Study of SVI for the Beta Bernoulli Process 



Figure 4. Results from interpolation and denoising of the 512 x 
512 pixel ‘Barbara’ image. The pictures shown are the original 
image (top left), the image to be reconstructed with 50 % of pixels 
unobserved and remaining pixels corrupted with Gaussian noise 
(top right), the Gibbs-SSVI reconstruction (bottom left) and the 
MF-SSVI reconstruction (bottom right). 

to different initialization methods and an example can be 
seen in the performance graphs of Figure 3. We found that 
initializing using MCMC improved the PSNR of MF-S VI, 
MF-SSVI and Titsias-SSVI by 4.8 on average versus ran¬ 
dom initialization. However, the analogous improvement 
for Gibbs-SSVI and Mimno-SVI was 1.0. This suggests 
that the methods which preserve intra-local variable struc¬ 
ture are less sensitive to initialization. 

Secondly, we considered the joint task of image interpo¬ 
lation and denoising. Here, we observe 50 % of the pix¬ 
els chosen uniformly at random, except they are now cor¬ 
rupted with Gaussian noise with standard deviation 15 (the 
original pixels take integer values in [0, 255]). Results for 
both image interpolation and denoising tasks are summa¬ 
rized in Table 1. Gibbs-SSVI and Mimno-SVI consistently 
outperform the other methods and an explanation, outlined 
in Section 5.3, as to why this is the case can be deduced by 
studying the images in Figures 3 and 4. 

For the 512 x 512 images, the average training time per 
epoch was 0.12,0.12,0.13,0.46 and 0.48 secs for the 
MF-SVI, MF-SSVI, Titsias-SSVI, Mimno-SVI and Gibbs- 
SSVI methods respectively, on a 2.4GHz dual core ma¬ 
chine. Among the multiple experiments, the MF-SSVI re¬ 
constructions were similar in appearance to the MF-SVI 
and Titsias-SSVI methods, whilst the Gibbs-SSVI recon¬ 
structions were similar to the Mimno-SVI ones. We be¬ 
lieve the blurred appearance of the former 3 methods’ re¬ 
constructions is a result of the independence between z^k 
and Znk' for k ^ k'm their variational forms. In contrast, 
the Gibbs-SSVI and Mimno-SVI methods maintain depen¬ 
dence between Znk and Znk ', and are therefore able to select 
a subset of features which collectively best explain the data. 
The latter methods are consequently much more capable of 


•lO® 



Figure 5. Predicitve loglikelihood versus training time on cell line 
data comparing five SVI algorithms. 


capturing structure and detail in the images we tested on, 
and there is a mild cost to pay in extra training time. 

5.3. A Thought Experiment 

To illustrate the problem with breaking dependencies be¬ 
tween Zik and Zik> for k ^ k', we can consider a simple 
thought experiment. Suppose = f+ei, G K^, where 
/ A/'(0, 1) and ~ A/'(0,0.057) independently for i = 

1,..., N. Now consider applying MF-SSVI and Gibbs-SVI 
algorithms to this dataset, whilst hxing Wik = 1 for each 
i, k, and using K = 2 features. Let’s assume that at the cur¬ 
rent iteration and 0^*^ « ^ 2 *^ « /. The local 

ELBO for the MF-SSVI method will have local optima for 
Oil = ^ — Si 2 , since exactly 1 feature is needed to explain 
the data, so MF-SSVI would hnd a local optimum of the 
form q{zii) = Bernoulli(s), q{zi 2 ) = Bernoulli(l — s) 
for some s G (0,1). (We were able to verify this form of 
local optimum empircally). The Gibbs-SSVI will generate 
samples of the form Zi = (0,1) and Zi = (1,0). 

We would like to predict with each model. Since q{zii) 
and q{zi 2 ) are independent under MF-SSVI, predictions 
will be yj = 0 with probability s(l — s), y^ « y^ with 
probability s)^ andhnally y.j « 2yj with probabil¬ 

ity s(l — s). Conversely, the Gibbs samper in Gibbs-SSVI 
will place little to no probability on both, or neither of the 2 
features being used for prediction. In summary, the Gibbs 
based local variable estimates can handle the strong corre¬ 
lation between the 2 features, whilst the MF based method 
cannot and suffers dramatically because of it. 

One possible solution to this problem would be to encour¬ 
age all features to have limited correlation apriori, discour¬ 
aging situations where multiple features are learned to be 
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Table 1. PSNR performance of image interpolation (left entries) and denoising (right entries) tasks using Gibbs initialization of global 
parameters on a randomly chosen subset of data. 



Boat 

Barbara 

Lena 

House 

Peppers 

MF-SVI 

21.1 

19.5 

21.8 

20.6 

24.1 

23.6 

25.3 

24.2 

25.9 

24.4 

MF-SSVI 

22.3 

20.8 

22.2 

21.4 

24.7 

24.4 

26.7 

25.4 

25.8 

24.1 

Titsias-SSVI 

23.2 

21.5 

22.1 

21.7 

26.3 

25.8 

26.7 

25.3 

27.9 

26.8 

Mimno-SVI 

32.4 

29.7 

36.2 

35.1 

39.4 

36.9 

42.8 

40.1 

43.7 

40.4 

Gibbs-SSVI 

34.3 

31.5 

38.2 

37.0 

43.3 

41.7 

40.5 

37.8 

47.4 

42.3 


similar to each other. Encouraging dissimilarity in such 
a way is challenging and would complicate the otherwise 
clean updates that are possible in SVI methods. 

5.4. Genomic data 

Vast amounts of genomic data are currently being collected 
as technology advances. It will be crucial to develop ma¬ 
chine learning models and more importantly, inference al¬ 
gorithms, which can cope with large data sets, whilst still 
retaining flexible modelling ability. We consider 2 datasets 
for which sparse latent feature modelling is appropriate. 
We use K = 500 features for both experiments. 

Cancer cell line data. The Cancer Cell Line Encyclope¬ 
dia is a collection of around 450 cancer samples including 
gene expression, copy number variation, and drug response 
information. We focus on modeling the gene expression 
data, which has measurements for around 15,000 genes. In 
this setting we are more interested in finding overlapping 
clusters (sparse features) of genes rather than samples, so 
we effectively have N = 15000, D = 450. The latent fac¬ 
tors found can then be interpreted as biological pathways, 
or sets of genes regulated by the same transcription factor. 
Understanding the structure in this data is valuable as a first 
step towards associating the cellular characteristics of the 
cancers to their drug response profiles. We randomly hold 
out 10 % of the data for testing. Results for this experiment 
are summarized in Eigure 5. 

CyTOF data. CyTOE is a novel extremely high through¬ 
put technology capable of measuring up to 40 protein abun¬ 
dance levels in thousands of individual cells per second. 
The cells are controlled using flow cytometry and spe¬ 
cific proteins are tagged using heavy metals which can be 
measured using time-of-flight mass spectrometry. Exist¬ 
ing analyses have attempted to group the observed cells 
into non-overlapping subpopulations, but we here show 
that the data can be effectively modeled as compromising 
of a spectrum of cell types expressing different latent fac¬ 
tors to differing extents. The sample we analyse consists 
of human immune cells, so representing the heterogeneity 
is relevant for understanding disease response. Our dataset 
has N = 532, 000, D = 40 and a random 5% is used as 


test data. The results for the experiments on this data fol¬ 
low a very similar pattern to that of the cell line gene ex¬ 
pression data. The converged predictive log-likelihoods af¬ 
ter training for 10 minutes are —l.leG, —9.6e5, —9.4e5, 
-3.8e5 and -3.2e5 for the ME-SVI, ME-SSVI, Titsias- 
SSVI, Mimno-SVI and Gibbs-SSVI methods respectively. 

6. Conclusions 

In this work, we compare various stochastic variational in¬ 
ference algorithms for beta process factor analysis. Whist 
many methods in the literature have been proposed, we 
have chosen to exploit the conditional conjugacy and the 
exponential family nature of our model to create simple 
natural parameter updates. 

Hoffman & Blei (2014) found that preserving structure be¬ 
tween local and global variables significantly boosted per¬ 
formance for the EDA, but based on our experiments, we 
conclude that preserving intra-local variable dependence is 
crucial to prediction in the beta-Bernoulli process. This is 
evident from the fact that both Gibbs-SSVI and Mimno- 
SVI consistently and significantly outperform ME-SVI, 
ME-SSVI and Titsias-SSVI on a variety of image interpo¬ 
lation and denoising tasks and on modelling genomic data. 
The Titsias-SSVI method models dependence between Zik 
and Wik, but does not appear to significantly outperform 
ME-SSVI, suggesting that this dependence is not crucial 
in prediction. Mimno-SVI does not maintain dependence 
between local and global variables whilst ME-SSVI does, 
and yet Mimno-SVI leads to better predictions. We discuss 
why this is the case in a simple thought experiment, show¬ 
ing the benefit of maintaining dependence between local 
variables where k ^ k'. The multi-cluster, sparse nature 
of the beta-Bernoulli process makes mean held type local 
variable approximations highly sensitive to correlated fea¬ 
tures. Gibbs-SSVI does also modestly outperform Mimno- 
SVI through maintaining dependencies between global and 
local variables. 

In summary, care is needed to ensure that the dependen¬ 
cies encoded by a particular variational approximation are 
appropriate for the model being considered. 
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Appendix 

Here, we provide details about the local variable approximations introduced in the main text of the paper. 


Mimno-SVI 


The form of the local approximation in the Mimno-SVI method is 


loggMimno(t/’*) = E,(^) [logp(t/?Jyi.jv,/ 3 )] 
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It is clear that log yMimno is quadratic in each Wik and linear in each Zik, therefore a Gibbs based sampler can easily be 
consructed to sample from (/Mimno, where Wik is Gaussian given all other local variables, and Zik is Bernoulli given all 
other local variables. 


MF-SSVI 

The local ELBO in the MF-SSVI framework is very similar to that of the MF-SVI, the difference being that samples of the 
global variables are used in MF-SSVI. The local EFBO has the following form 
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This is optimized as a function of {Oik, ^ik, Hik using gradient descent. Once a local optimum is found, „|/ 3 (‘)) [tlj 

can be computed analytically as a function of the optimized parameters and global variable samples. 

Titsias-SSVI 

Recall that the Titsias-SSVI method maintains dependence between Zik and Wik for each k. The local EFBO for Titsias- 
SSVI is 
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Again, this function is maxized as a function of Kik using gradient descent, and the optimized parameters along 

with the global variable samples are used to compute j„|/ 3 (*)) [t7i] analytically. 


Gibbs-SSVI 

The Gibbs-SVI method uses the true posterior conditional distribution for local variables 


loggGibbs(t/’*) = logp(t/?Jyi.^,/3) 


Tobs I 


y^ - {z^oWi)^\\ +^2^fclog( — 


'^k 


1 - TTfe 


7ju T I i 

- —WiW^ + const 


7obs 


'^ZikWik4>k Wik4>l + ( ] -‘^yj 

k ^ ^ j^k 


+ Zik log 


'^k 


1 - TTfe 


Tu; T I j. 

- —WiW^ + const 


Just as was the case with Mimno-SVI, we notice that log gcibbs is quadratic in each Wik and linear in each Zik, therefore a 
Gibbs sampler can be designed to sample from gcibbs- 




